Skip to content

DRAFT: Parquet 3 metadata with decoupled column metadata#242

Closed
pitrou wants to merge 1 commit intoapache:masterfrom
pitrou:v3-metadata
Closed

DRAFT: Parquet 3 metadata with decoupled column metadata#242
pitrou wants to merge 1 commit intoapache:masterfrom
pitrou:v3-metadata

Conversation

@pitrou
Copy link
Member

@pitrou pitrou commented May 16, 2024

Parquet 3 metadata proposal

This is a very rough attempt at solving the problem of FileMetadata footprint and decoding cost, especially for Parquet files with many columns (think tens of thousands columns).

Context

This is in the context of the broader "Parquet v3" discussion on the mailing-list. A number of possible far-reaching changes are being collected in a document.

It is highly recommended that you read at least that document before commenting on this PR.

Specifically, some users would like to use Parquet files for data with tens of thousands of columns, and potentially hundreds or thousands of row groups. Reading the file-level metadata for such a file is prohibitively expensive given the current file structure where all column-level metadata is eagerly decoded as part of file-level metadata.

Contents

It includes a bunch of changes:

  1. a new "Parquet 3" file structure with backwards compatibility with legacy readers
  2. new Thrift structures allowing for decoupled decoding of file-level metadata and column metadata: file metadata is now O(n_columns + n_row_groups) instead of O(n_columns * n_row_groups)
  3. removal of outdated, redundant or undesirable fields from the new structures

Jira

Commits

  • My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain Javadoc that explain what it does

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.